Simple Linear Regression
Introduction This page provides the basics of linear regression, focusing mainly on data for continuous outcomes. Linear regression is widely used to see how certain predictors (independent variables, covariates) are related to a certain outcome measure. Example We will be using Ronald Fisher's Iris data set1 . It contains various measurements on iris flowers of different species. One continuous predictor Before selecting a model, especially for a linear regression, it is helpful to look at a scatterplot. For example, if we were looking to predict petal length by using sepal length, we would look at the following plot: From the plot we can see that there seems to be approximately a linear trend relating sepal length and petal length. Using statistical software, we can fit a linear model to this data, also commonly referred to as a trend line or line of best fit. Typical results from such a model may look like this: The Intercept term here is not incredibly meaningful, it tells us that when there is a sepal length of 0 cm, the predicted petal length is approximately -7.10 centimeters. However, the coefficient for Sepal Length is meaningful. It tells us that for every centimeter increase in Sepal Length, there is a predicted increase of approximately 1.85 centimeters in Petal Length. The "Std. Error" column is a measure of how precise each column is, a smaller standard error, meaning that the estimate is more precise. The "t value" column provides the statistic from the T-test, testing whether the coefficient is 0. Usually what is of importance is the column labeled "Pr(>|t|)" this is what is commonly referred to as the p-value. This is a test of whether the coefficient is significantly different from 0, in other words, is the predictor associated with the response. By convention, a p-value of less than 0.05 connotes a significant association. In this example we can see that the p-value for the Sepal Length term is incredibly small (<2 x 10-16). Which indicates that Sepal Length is significantly associated with Petal Length. Another relevant term to note is the value for Multiple R-squared. In terms of this model, the R-squared value indicates that 76% of the variability in Petal Length can be explained by Sepal Length. Plot with model fitted to data: One Continuous Predictor, One Categorical Predictor In the previous example, we saw that using Sepal Length alone to predict Petal Length can work reasonably well. However, you can see from the scatterplot that there appears to be some clusters, or groups of observations that seem closer to each other than the rest of the group. In fact, this difference may be due to the presence of different species in the sample as seen by the plot below: Perhaps we can improve on our previous model by making a new model that incorporates information about species. We can start by considering including species as a "main effect". Graphically, this means that we think that there should be 3 different lines would be appropriate for the data, one for each species. The lines would have a different intercept for each species, but their slopes would all be the same. The model results would be the following: With this model, the coefficients have a slightly different interpretation. The Sepal Length coefficient still corresponds to the increase in Petal Length for every centimeter increase in Sepal Length, but after controlling for Species. The Intecept term now corresponds to the intercept for the reference group, which in this case is the Setosa species. The Speciesversicolor term corresponds to the difference in intercept for the versicolor species relative to the reference Setosa group. A similar interpretation holds for the virginica species. The Speciesversicolor '''and '''Speciesvirgnica p-values are testing whether the intercepts for each of these species is significantly different than the intercept for the reference Setosa group, in both cases the p-value is highly significant. When adding more than one term to a model, Adjusted R-squared 'is more appropriate to look at. It is a similar measure to the normal R-Squared, but provides a penalty for adding more terms to a model. Here though, we can see a significantly higher adjusted R-squared, meaning that although this is a more complicated model, there is a substantial gain in the variability explained by this model compared to the earlier model that only considered Sepal Length. Below is a graphical representation of the model, from which we can see is a much better fit to the data. One Continuous Predictor, One Categorical Predictor with Interaction Suppose that we think that the effect of Sepal Length on Petal Length differs by species, in other words that there is an ''interaction effect. This would correspond to a model with 3 different intercepts and 3 different slopes: The '''Intercept, Speciesversicolor, '''and '''Speciesvirginica '''terms have the same interpretation as before. However, '''Sepal.Length here refers only to the slope of the reference setosa group. '''Sepal.Length:Speciesversicolor '''is the difference in the versicolor slope compared to the setosa group, this is known as the ''interaction ''effect. The difference in the virginica slope is also relative to the setosa slope. We also see a modest increase in adjusted R-squared for this model (0.9781 vs. 0.9744). This means that this model explains more variability, even though it is a more complicated model. Plot of the interaction model: